Web-scale Topic Models in Spark: An Asynchronous Parameter Server
نویسندگان
چکیده
In this paper, we train a Latent Dirichlet Allocation (LDA) topic model on the ClueWeb12 data set, a 27-terabyte Web crawl. We extend Spark, a popular tool for performing large-scale data analysis, with an asynchronous parameter server. Such a parameter server provides a distributed and concurrently accessed parameter space for the model. A Metropolis-Hastings based collapsed Gibbs sampler is implemented using this parameter server achieving an amortized O(1) sampling complexity. We compare our implementation to the default Spark implementations and show that it is several orders of magnitude more scalable without sacrificing model quality. A topic model with 1,000 topics is trained on the full ClueWeb12 data set, uncovering some of the prevalent themes that appear on the Web.
منابع مشابه
An Efficient Threading Model to Boost Server Performance
Multi-threading remains a popular choice for server architecture. Widely used applications like the Apache web server, and the MySQL database server are written in a multi-threaded fashion. We consider thread architectures from two angles: (1) number of user threads per kernel thread, and (2) use of synchronous I/O vs. asynchronous I/O, and consider their effects on server performance. Our clai...
متن کاملHow Data Volume Affects Spark Based Data Analytics on a Scale-up Server
Sheer increase in volume of data over the last decade has triggered research in cluster computing frameworks that enable web enterprises to extract big insights from big data. While Apache Spark is gaining popularity for exhibiting superior scale-out performance on the commodity machines, the impact of data volume on the performance of Spark based data analytics in scale-up configuration is not...
متن کاملLightLDA: Big Topic Models on Modest Compute Clusters
When building large-scale machine learning (ML) programs, such as massive topics models or deep networks with up to trillions of parameters and training examples, one usually assumes that such massive tasks can only be attempted with industrial-sized clusters with thousands of nodes, which are out of reach for most practitioners or academic researchers. We consider this challenge in the context...
متن کاملInvestigation on Reliability Estimation of Loosely Coupled Software as a Service Execution Using Clustered and Non-Clustered Web Server
Evaluating the reliability of loosely coupled Software as a Service through the paradigm of a cluster-based and non-cluster-based web server is considered to be an important attribute for the service delivery and execution. We proposed a novel method for measuring the reliability of Software as a Service execution through load testing. The fault count of the model against the stresses of users ...
متن کاملConsistent Bounded-Asynchronous Parameter Servers for Distributed ML
In distributed ML applications, shared parameters are usually replicated among computing nodes to minimize network overhead. Therefore, proper consistency model must be carefully chosen to ensure algorithm’s correctness and provide high throughput. Existing consistency models used in generalpurpose databases and modern distributed ML systems are either too loose to guarantee correctness of the ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1605.07422 شماره
صفحات -
تاریخ انتشار 2016